Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion

An Author Correction to this article was published on 19 October 2022

This article has been updated

Abstract

Neural networks need the right representations of input data to learn. Here we ask how gradient-based learning shapes a fundamental property of representations in recurrent neural networks (RNNs)—their dimensionality. Through simulations and mathematical analysis, we show how gradient descent can lead RNNs to compress the dimensionality of their representations in a way that matches task demands during training while supporting generalization to unseen examples. This can require an expansion of dimensionality in early timesteps and compression in later ones, and strongly chaotic RNNs appear particularly adept at learning this balance. Beyond helping to elucidate the power of appropriately initialized artificial RNNs, this fact has implications for neurobiology as well. Neural circuits in the brain reveal both high variability associated with chaos and low-dimensional dynamical structures. Taken together, our findings show how simple gradient-based learning rules lead neural networks to solve tasks with robust representations that generalize to new cases.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Task and model schematic.
Fig. 2: Dynamical and geometric properties of networks learning to classify high-dimensional inputs.
Fig. 3: Dynamical and geometric properties of networks learning to classify two-dimensional inputs.
Fig. 4: Dynamical and geometric properties of networks learning to classify two-dimensional inputs restricted to two neurons.
Fig. 5: Networks with mean squared error loss and linear units continue to compress dimensionality.
Fig. 6: Variability of gradients compresses neural representations in networks with injected hidden unit noise and regularization.

Similar content being viewed by others

Data availability

All data used in the paper are generated by the code at refs. 61.

Code availability

Code for training the networks and generating the plots can be found in a Code Ocean capsule61.

Change history

References

  1. Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. EC-14, 326–334 (1965).

    Article  MATH  Google Scholar 

  2. Fusi, S., Miller, E. K. & Rigotti, M. Why neurons mix: high dimensionality for higher cognition. Curr. Opin. Neurobiol. 37, 66–74 (2016).

    Article  Google Scholar 

  3. Vapnik, V. N. Statistical Learning Theory (Wiley-Interscience, 1998).

  4. Litwin-Kumar, A., Harris, K. D., Axel, R., Sompolinsky, H. & Abbott, L. F. Optimal degrees of synaptic connectivity. Neuron 93, 1153–1164 (2017).

    Article  Google Scholar 

  5. Cayco-Gajic, N. A., Clopath, C. & Silver, R. A. Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nat. Commun. 8, 1116 (2017).

    Article  Google Scholar 

  6. Wallace, C. S. & Boulton, D. M. An information measure for classification. Comput. J. 11, 185–194 (1968).

    Article  MATH  Google Scholar 

  7. Rissanen, J. Modeling by shortest data description. Automatica 14, 465–471 (1978).

    Article  MATH  Google Scholar 

  8. Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).

    Article  Google Scholar 

  9. Ansuini, A., Laio, A., Macke, J. H. & Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Adv. Neural Inf. Process. Syst. 32, 11 (2019).

  10. Recanatesi, S. et al. Dimensionality compression and expansion in Deep Neural Networks. Preprint at https://arxiv.org/abs/1906.00443 (2019).

  11. Cohen, U., Chung, S. Y., Lee, D. D. & Sompolinsky, H. Separability and geometry of object manifolds in deep neural networks. Nat. Commun. 11, 746 (2020).

  12. Jaeger, H. The ‘Echo State’ Approach to Analysing and Training Recurrent Neural Networks—with an Erratum Note. GMD Technical Report 148 (German National Research Center for Information Technology, 2001).

  13. Maass, W., Natschläger, T. & Markram, H. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 14, 2531–2560 (2002).

  14. Legenstein, R. & Maass, W. Edge of chaos and prediction of computational performance for neural circuit models. Neural Netw. 20, 323–334 (2007).

    Article  MATH  Google Scholar 

  15. Keup, C., Tobias K., David D. & Moritz H. Transient chaotic dimensionality expansion by recurrent networks. Physical Review X 11 (June 2021): 021064. https://doi.org/10.1103/PhysRevX.11.021064

  16. Vreeswijk, C. V. & Sompolinsky, H. Chaotic balanced state in a model of cortical circuits. Neural Comput. 10, 1321–1371 (1998).

    Article  Google Scholar 

  17. Litwin-Kumar, A. & Doiron, B. Slow dynamics and high variability in balanced cortical networks with clustered connections. Nat. Neurosci. 15, 1498–1505 (2012).

    Article  Google Scholar 

  18. Wolf, F., Engelken, R., Puelma-Touzel, M., Weidinger, J. D. F. & Neef, A. Dynamical models of cortical circuits. Curr. Opin. Neurobiol. 25, 228–236 (2014).

    Article  Google Scholar 

  19. Lajoie, G., Lin, K. & Shea-Brown, E. Chaos and reliability in balanced spiking networks with temporal drive. Phys. Rev. E 87, 2432–2437 (2013).

    Article  Google Scholar 

  20. London, M., Roth, A., Beeren, L., Häusser, M. & Latham, P. E. Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex. Nature 466, 123–127 (2010).

    Article  Google Scholar 

  21. Stam, C. J. Nonlinear dynamical analysis of EEG and MEG: review of an emerging field. Clin. Neurophysiol. 116, 2266–2301 (2005).

    Article  Google Scholar 

  22. Engelken, R. & Wolf, F. Dimensionality and entropy of spontaneous and evoked rate activity. In APS March Meeting Abstracts, Bull. Am. Phys. Soc. eP5.007 (2017).

  23. Kaplan, L. J. & Yorke, J. A. In Functional Differential Equations and Approximations of Fixed Points: Proceedings, Bonn, July 1978 204–227 (Springer, 1979).

  24. Sussillo, D. & Abbott, L. F. Generating coherent patterns of activity from chaotic neural networks. Neuron 63, 544–557 (2009).

    Article  Google Scholar 

  25. DePasquale, B., Cueva, C. J., Rajan, K., Escola, G. S. & Abbott, L. F. full-FORCE: A target-based method for training recurrent networks. PLoS ONE 13, e0191527 (2018).

    Article  Google Scholar 

  26. Stern, M., Olsen, S., Shea-Brown, E., Oganian, Y. & Manavi, S. In the footsteps of learning: changes in network dynamics and dimensionality with task acquisition. In Proc. COSYNE 2018, abstract no. III-100.

  27. Farrell, M. Revealing Structure in Trained Neural Networks Through Dimensionality-Based Methods. PhD thesis, Univ. Washington (2020).

  28. Rajan, K., Abbott, L. F. & Sompolinsky, H. Stimulus-dependent suppression of chaos in recurrent neural networks. Phys. Rev. E 82, 011903 (2010).

    Article  Google Scholar 

  29. Bell, R. J. & Dean, P. Atomic vibrations in vitreous silica. Discuss. Faraday Soc. 50, 55–61 (1970).

    Article  Google Scholar 

  30. Gao, P., Trautmann, E., Yu, B. & Santhanam, G. A theory of multineuronal dimensionality, dynamics and measurement. Preprint at bioRxiv https://doi.org/10.1101/214262 (2017).

  31. Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999).

  32. Goodfellow, I., Lee, H., Le, Q. V., Saxe, A. & Ng, A. Y. Measuring invariances in deep networks. Adv. Neural Inf. Process. Syst. 22, 646–654 (2009).

  33. Lajoie, G., Lin, K. K., Thivierge, J.-P. & Shea-Brown, E. Encoding in balanced networks: revisiting spike patterns and chaos in stimulus-driven systems. PLoS Comput. Biol. 12, e1005258 (2016).

    Article  Google Scholar 

  34. Huang, H. Mechanisms of dimensionality reduction and decorrelation in deep neural networks. Phys. Rev. E 98, 062313–062322 2018).

    Article  Google Scholar 

  35. Kadmon, J. & Sompolinsky, H. Optimal architectures in a solvable model of deep networks. Adv. Neural Inf. Process. Syst. 29, 4781–4789 (2016).

  36. Papyan, V., Han, X. Y. & Donoho, D. L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl Acad. Sci. USA 117, 24652–24663 (2020).

    Article  MathSciNet  MATH  Google Scholar 

  37. Shwartz-Ziv, R. & Tishby, N. Opening the black box of Deep Neural Networks via information. Preprint at https://arxiv.org/abs/1703.00810 (2017).

  38. Shwartz-Ziv, R., Painsky, A. & Tishby, N. Representation compression and generalization in Deep Neural Networks. Preprint: OpenReview (2019).

  39. Babadi, B. & Sompolinsky, H. Sparseness and expansion in sensory representations. Neuron 83, 1213–1226 (2014).

    Article  Google Scholar 

  40. Marr, D. A theory of cerebellar cortex. J. Physiol. 202, 437–470.1 (1969).

    Article  Google Scholar 

  41. Albus, J. S. A theory of cerebellar function. Math. Biosci. 10, 25–61 (1971).

    Article  Google Scholar 

  42. Stringer, C., Pachitariu, M., Steinmetz, N., Carandini, M. & Harris, K. D. High-dimensional geometry of population responses in visual cortex. Nature 571, 361–365 (2019).

  43. Mazzucato, L., Fontanini, A. & LaCamera, G. Stimuli reduce the dimensionality of cortical activity. Front. Syst. Neurosci. 10, 11 (2016).

    Article  Google Scholar 

  44. Rosenbaum, R., Smith, M. A., Kohn, A., Rubin, J. E. & Doiron, B. The spatial structure of correlated neuronal variability. Nat. Neurosci. 20, 107–114 (2017).

  45. Landau, I. D. & Sompolinsky, H. Coherent chaos in a recurrent neural network with structured connectivity. PLoS Comput. Biol. 14, e1006309 (2018).

  46. Huang, C. et al. Circuit models of low-dimensional shared variability in cortical networks. Neuron 101, 337–348.e4 (2019).

    Article  Google Scholar 

  47. Mastrogiuseppe, F. & Ostojic, S. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron 99, 609–623.e29 (2018).

    Article  Google Scholar 

  48. Mazzucato, L., Fontanini, A. & La Camera, G. Dynamics of multistable states during ongoing and evoked cortical activity. J. Neurosci. 35, 8214–8231 (2015).

    Article  Google Scholar 

  49. Cunningham, J. P. & Yu, B. M. Dimensionality reduction for large-scale neural recordings. Nat. Neurosci. 17, 1500–1509 (2014).

    Article  Google Scholar 

  50. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016); http://www.deeplearningbook.org

  51. Faisal, A. A., Selen, L. P. J. & Wolpert, D. M. Noise in the nervous system. Nat. Rev. Neurosci. 9, 292–303 (2008).

    Article  Google Scholar 

  52. Freedman, D. J. & Assad, J. A. Experience-dependent representation of visual categories in parietal cortex. Nature 443, 85–88 (2006).

    Article  Google Scholar 

  53. Dangi, S., Orsborn, A. L., Moorman, H. G. & Carmena, J. M. Design and analysis of closed-loop decoder adaptation algorithms for brain–machine interfaces. Neural Comput. 25, 1693–1731 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  54. Orsborn, A. L. & Pesaran, B. Parsing learning in networks using brain–machine interfaces. Curr. Opin. Neurobiol. 46, 76–83 (2017).

    Article  Google Scholar 

  55. Recanatesi, S. et al. Predictive learning as a network mechanism for extracting low-dimensional latent space representations. Nat. Commun. 12, 1417 (2021).

  56. Banino, A. et al. Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429 (2018).

    Article  Google Scholar 

  57. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. In 5th International Conference on Learning Representations https://doi.org/10.48550/arXiv.1609.04836 (2017).

  58. Advani, M. S., Saxe, A. M. & Sompolinsky, H. High-dimensional dynamics of generalization error in neural networks. Neural Netw. 132, 428–446 (2020).

    Article  MATH  Google Scholar 

  59. Li, Y. & Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. in Advances in neural information processing systems (eds. Bengio, S. et al.) vol. 31 (Curran Associates, Inc., 2018).

  60. Lipton, Z. C., Berkowitz, J. & Elkan, C. A critical review of recurrent neural networks for sequence learning. https://arxiv.org/abs/1506.00019 (2015).

  61. Farrell, M. Gradient-based learning drives robust representations in RNNs by balancing compression and expansion. Code Ocean https://doi.org/10.24433/CO.5101546.v1 (2022).

Download references

Acknowledgements

M.F. was funded by the National Science Foundation Graduate Research Fellowship under Grant DGE-1256082. G.L. is funded by an NSERC Discovery Grant (RGPIN-2018-04821), an FRQNT Young Investigator Startup Program (2019-NC-253251) and an FRQS Research Scholar Award, Junior 1 (LAJGU0401-253188). E.S.-B. acknowledges the support of NSF DMS Grant 1514743. M.F. thanks the Swartz Program in Theoretical Neuroscience at Harvard and S.R. thanks the Swartz Center for Theoretical Neuroscience at the University of Washington for support. We thank M. Stern, D. Chklovskii, A. Weber, N. Steinmetz and L. Mazzucato for their insights and suggestions. M.F. would also like to thank H. Sompolinsky and S. Chung for their mentorship and inspiration.

Author information

Authors and Affiliations

Authors

Contributions

M.F., S.R. and E.S.-B. conceived the study. M.F. wrote code and ran simulations with some guidance from S.R. The manuscript was primarily written by M.F., with substantial edits and contributions made by S.R., G.L. and E.S.-B. G.L. contributed code for computing Lyapunov exponents and provided additional insight. T.M. ran the simulations for Extended Data Fig. 9 and ran additional verification experiments for intermediate values of β.

Corresponding author

Correspondence to Matthew Farrell.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Cristina Savin and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Effects of changing the evaluation timestep and number of recurrent units.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify high-dimensional inputs. Details are as in Fig. 2e. Shaded regions are as defined in Fig. 2e. First row: network trained with a categorical cross-entropy loss with a learning rate of 1e-4. Second row: network trained with a mean squared error loss with a learning rate of 1e-3. First column: evaluation time is t = 6. Second column: evaluation time is t = 10. Third column: evaluation time is t = 14. Fourth column: Number of hidden neurons is increased to N = 300. Evaluation time is t = 14.

Extended Data Fig. 2 Effects of changing the evaluation timestep, input dimension, and number of recurrent units.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. All networks are trained with a categorical cross-entropy loss and a learning rate of 1e-4 (note that this is a factor of 10 less than used in the main text). Shaded regions are as defined in Fig. 2e. Other details are as in Fig. 3e. First row: 2-dimensional inputs. Second row: 4-dimensional inputs. Third row: 10-dimensional inputs. First column: evaluation time is t = 6. Second column: evaluation time is t = 10. Third column: evaluation time is t = 14. Fourth column: Number of hidden neurons is increased to N = 300. Evaluation time is t = 14.

Extended Data Fig. 3 Effects of changing the evaluation timestep, input dimension, and number of recurrent units on logistic regression testing accuracy.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. Details are as in Extended Data Fig. 2, but this time measuring the logistic regression testing accuracy as defined in the main text.

Extended Data Fig. 4 Effects of changing the evaluation timestep, input dimension, and number of recurrent units with 120 input clusters.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. Details are as in Extended Data Fig. 2, but here 120 input clusters are used instead of 60.

Extended Data Fig. 5 Between-class distances are increased while within class distances are diminished by the network dynamics.

Mean pairwise distance between points belonging to the same class (dashed lines), mean pairwise distance between points belonging to different classes (dotted lines), and the ratio of the first to the second (blue lines and axes), for the representations of trained networks over time t. Details are as in Fig. 2e in the main text. Shaded regions are as defined in Fig. 2e. a. Edge-of-chaos network as defined in the main text. b. Strongly chaotic network as defined in the main text.

Extended Data Fig. 6 Dependence of dimensionality on the learning rate.

Here we reproduce the results of Figs. 2e and 3e of the main text, using a different learning rate. Red lines correspond to strongly chaotic and cyan lines to edge-of-chaos networks. Dashed and solid lines depict before and after training, respectively. Shaded regions are as defined in Fig. 2e. Top row: high-dimensional inputs as in Fig. 2e. Bottom row: low-dimensional inputs as in Fig. 3e. Left column: learning rate of 1e-4. Right column: learning rate of 1e-3, as in the main text.

Extended Data Fig. 7 Dimensionality increases with number of class labels, but not number of clusters.

Effective dimensionalities (EDs) of the trained network responses to inputs embedded in an N-dimensional space, measured at the evaluation time teval = 10. Error bars denote two standard deviations of three initializations of task and networks (in all panels they are too small to see). Details are similar to Fig. 2. a. Edge-of-chaos networks. Blue: ED of the inputs. Green: ED of the network representation as a function of the number of input clusters. Dimensionality remains flat and small. b. Edge-of-chaos networks. Green: ED of the network representation as a function of the number of class labels. Black: Effective dimensionality of points distributed uniformly at random in an N-dimensional ball. The number of points drawn is determined by the number of class labels. This is to roughly measure what the ED of the network would be if it formed a fixed point for every class label, and distributed these fixed points randomly in space. c. Strongly chaotic networks. Legend as in a. d. Strongly chaotic networks. Legend as in b.

Extended Data Fig. 8 Example of noise in output weights driving compression of the hidden representation in a linear network with two hidden layer units.

The equation for the network is h = Wx + b with output \(\hat{o}={{{{\boldsymbol{r}}}}}^{T}{{{\boldsymbol{h}}}}\). The input weights (red) are initialized to the 2 × 2 identity matrix, and bias is initialized as (1, 0). The inputs are placed on a grid from x = − 1 to x = 2 and from y = − 3 to y = 3 (not shown). Network output \(\hat{o}\) is trained to minimize the squared error loss \(0.5{(\hat{o}-1)}^{2}\). Input samples are chosen randomly, and input weights are updated via stochastic gradient descent with batch size 1. a. Top: Diagram of network where input weights are trained and output weights are fixed. Bottom: Diagram of network where input weights are trained and output weights are drawn from a normal distributed with mean (1, 0) and covariance 0.05I at every update step. In the figure, η represents additive white noise. Middle: hidden unit responses (blue circles) to the inputs before training (iteration 0). Black dot denotes the output weight vector, and the blue line is the affine subspace of points that r maps to 1. b. Evolution of the hidden layer response to inputs (representation) as input weights are trained. Top: Representation of the network where output weights are fixed. The iteration number denotes the number of training samples that have been used to update the weights. Activations compress to the space orthogonal to r, shifted by (1, 0). Bottom: Representation of the network where output weights are randomly drawn at every input sample presentation. Activations compress to a compact, localized space. The direction of compression is both along and orthogonal to r.

Extended Data Fig. 9 Representations of RNNs trained on the MNIST digit recognition dataset.

a. Effective dimensionality (ED) of RNNs trained on the MNIST digit recognition dataset. The ED of the network’s responses to test inputs is plotted. After training, dimensionality compresses down to a value by t = 10 that roughly matches the number of class labels (10). This compression is similar to that seen in Fig. 2 of the main text. Details are as in Fig. 2e of the main text. Shaded regions are as defined in Fig. 2e. b. Projection onto the top three principal components of MNIST test data. Colours indicate true class label (i.e., digit identity). c. Projection onto the top three principal components of the edge-of-chaos recurrent network’s responses to the inputs in b after training, at the evaluation time t = 10. Colours indicate true class label as in b. The network forms a localized cluster for each digit.

Extended Data Fig. 10 Effects of changing the initial coupling strength β.

Details are as in Fig. 3e, except that here we vary the coupling parameter β, whose value is indicated by the colourbar to the right. Shaded regions are as defined in Fig. 2e. First column: ED of networks before training. Second column: ED of networks after training with a learning rate of 1e-3. Third column: ED of networks after training with a learning rate of 1e-4.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Farrell, M., Recanatesi, S., Moore, T. et al. Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. Nat Mach Intell 4, 564–573 (2022). https://doi.org/10.1038/s42256-022-00498-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00498-0

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing